Meta has trained its latest AI model, Llama 3.1, using 16,384 GPUs, showcasing the astonishing speed of AI technological advancement. However, during this process, an average of one failure occurs every 3 hours, totaling 419 failures, with approximately half related to the H100 GPUs and their HBM3 memory. This data reveals the reliability challenges faced by supercomputing systems in their pursuit of performance breakthroughs. The complexity of the Llama 3.1 training cluster is comparable to a small city's neural network, with frequent failures. The Meta team has implemented strategies such as reducing...